Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] AIOps: @kbn/aiops-api-plugin #179695

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

walterra
Copy link
Contributor

@walterra walterra commented Mar 29, 2024

Summary

DRAFT / POC / WORK IN PROGRESS

Implements #178613.

This PR creates a new plugin @kbn/aiops-api-plugin. Its purpose is to host AIOps related functionality to be consumed by Kibana stack and its solutions. In contrast to that, the APIs in @kbn/aiops-plugins should be considered completely internal and are implementation details of UI components in the ML UI. The separate plugin is created to isolate these APIs so we don't run in to problems with cyclical dependencies.

This version of Log Rate Analysis at a minimum just needs an index name, a time range and the time field name of the index (TODO: Still need to add support for an optional query to limit the scope of the analysis). It will then run the following analysis:

  • Change point detection runs over the time range. Based on detected spikes/dips it identifies baseline and deviation time ranges to run the analysis.
  • Using field caps it identifies fields suitable for analysis. If you know the fields you want to analyse up front, you can provide them and it will skip this step.
  • Significant terms with the p-value option is used to identify statistically significant keyword-like fields.
  • Text categorization and custom code we also use for data drift detection is used to identify statistically significant patterns in text fields.

The analysis then returns some metadata about the change point, the significant items, as well as the date histogram used to identify the change point. This can be used then to populate custom UIs or hand it off to an LLM.

The PoC in this PR exposes Log Rate Analysis in three different ways to be used: Plain async function, Kibana REST API endpoint, O11y AI Assistant registered function. Note the naming of functions/endpoints etc. should be considered experimental/up for discussion. Feedback very welcome.

plain async function

This allows adding log rate analysis to be included in your own Kibana server side code, for example as part of a custom REST API endpoint.

/**
 * Fetches log rate analysis data.
 *
 * @param esClient Elasticsearch client.
 * @param index The Elasticsearch source index pattern.
 * @param start The start of the time range, in Elasticsearch date math, like `now`.
 * @param end The end of the time range, in Elasticsearch date math, like `now-24h`.
 * @param timefield The Elasticesarch source index pattern time field.
 * @param abortSignal Abort signal.
 * @param keywordFieldCandidates Optional keyword field candidates.
 * @param textFieldCandidates Optional text field candidates.
 * @returns Log rate analysis data.
 */
export const fetchSimpleLogRateAnalysis = async (
  esClient: ElasticsearchClient,
  index: string,
  start: string,
  end: string,
  timefield: string,
  abortSignal?: AbortSignal,
  keywordFieldCandidates: string[] = [],
  textFieldCandidates: string[] = []
) => { ... }
import { fetchSimpleLogRateAnalysis } from '@kbn/aiops-log-rate-analysis/queries/fetch_simple_log_rate_analysis';

...

const logRateAnalysis = await fetchSimpleLogRateAnalysis(
  client,
  '.ds-filebeat-8.2.0-2022.06.07-000082',
  'Jun 7, 2022 @ 03:50:26.914',
  'Jun 7, 2022 @ 10:00:31.928',
  '@timestamp',
  abortSignal,
  ['service.name'],
  ['message']
);

Kibana REST API endpoint

The REST API endpoint just wraps the function from above. Use in Kibana Dev Tools looks like this:

POST kbn:/internal/aiops/simple_log_rate_analysis
{
    "index": ".ds-filebeat-8.2.0-2022.06.07-000082",
    "start": "Jun 7, 2022 @ 03:50:26.914",
    "end": "Jun 7, 2022 @ 10:00:31.928",
    "timefield": "@timestamp",
    "keywordFieldCandidates": ["service.name"]
}

This is the response you'll get:

{
  "logRateChange": {
    "type": "spike",
    "timestamp": 1654586417118,
    "logRateChangeCount": 103491,
    "averageLogRateCount": 9266,
    "logRateAggregationIntervalUsedForAnalysis": "5 minutes",
    "documentSamplingFactorForAnalysis": 0.1,
    "extendedChangePoint": {
      "startTs": 1654586121051,
      "endTs": 1654587009252
    }
  },
  "significantItems": [
    {
      "field": "service.name",
      "value": "postgres",
      "type": "metadata",
      "documentCount": 156280,
      "baselineCount": 60680,
      "logRateChangeSort": 2.5754779169413315,
      "logRateChange": "2.58x increase"
    }
  ],
  "dateHistogramBuckets": {
    "1654566580629": 4287,
    "1654566876696": 5411,
    "1654567172763": 5215,
    "1654567468830": 4691,
...
    "1654579903644": 7708,
    "1654580199711": 7059,
    "1654588193520": 1996,
    "1654588489587": 12337,
    "1654588785654": 3082
  },
  "windowParameters": {
    "baselineMin": 1654579692171,
    "baselineMax": 1654586121051,
    "deviationMin": 1654586417118,
    "deviationMax": 1654587009252
  }
}

Some of the attributes may seem a bit verbose but that wording seems to help the AI Assistant make better sense of what it has to interpretate. The type attribute for significant items gets returned as metadata for keyword-like fields and 'log message pattern for text fields, again to give a better hint to LLMs instead of just keyword/text.

O11y AI Assistant registered function

The analysis is also available as a registered function get_aiops_log_rate_analysis for the Observability AI Assistant. It registers a custom starter prompt in Discover that will then trigger the analysis for the data on display:

image

Checklist

Delete any items that are not applicable to this PR.

@walterra walterra self-assigned this Mar 29, 2024
@walterra walterra force-pushed the ml-aiops-public-api branch 2 times, most recently from ede8504 to ec7811b Compare April 5, 2024 07:26
@walterra walterra force-pushed the ml-aiops-public-api branch 3 times, most recently from f6279c7 to 12d29ae Compare May 13, 2024 08:41
@walterra walterra force-pushed the ml-aiops-public-api branch 2 times, most recently from c26bd9d to e853b01 Compare May 21, 2024 14:23
@walterra walterra force-pushed the ml-aiops-public-api branch 3 times, most recently from 81d8f64 to 013266b Compare May 24, 2024 08:01
@walterra walterra force-pushed the ml-aiops-public-api branch 2 times, most recently from 8590712 to c7a009f Compare June 18, 2024 12:32
Comment on lines 76 to 85
export const fetchSimpleLogRateAnalysis = async (
esClient: ElasticsearchClient,
index: string,
start: string,
end: string,
timefield: string,
abortSignal?: AbortSignal,
keywordFieldCandidates: string[] = [],
textFieldCandidates: string[] = []
) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use named function with destructured params.

const pValuesQueue = queue(async function (payload: QueueFieldCandidate) {
if (isKeywordFieldCandidate(payload)) {
const { keywordFieldCandidate } = payload;
let pValues: Awaited<ReturnType<typeof fetchSignificantTermPValues>> = [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict you can remove this line and use const pValues = ... below

Suggested change
let pValues: Awaited<ReturnType<typeof fetchSignificantTermPValues>> = [];

esClient,
params,
[keywordFieldCandidate],
undefined,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads really poorly. What is undefined? Would be better with named (destructured) parameters where this can be left out.

* @param textFieldCandidates Optional text field candidates.
* @returns Log rate analysis data.
*/
export const fetchSimpleLogRateAnalysis = async (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the name fetchSimpleLogRateAnalysis: are there other types of log rate analysis that are not "simple" but advanced? Or offer other features?

@@ -82,6 +82,7 @@ export async function getApmServiceSummary({
apmAlertsClient: ApmAlertsClient;
logger: Logger;
}): Promise<ServiceSummary> {
console.log('SERVICE SUMMARY!!');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug

Comment on lines +26 to +31
entities: {
'service.name'?: string;
'host.name'?: string;
'container.id'?: string;
'kubernetes.pod.name'?: string;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these unused?

Comment on lines +35 to +43
return await fetchLogRateAnalysis({
esClient,
arguments: {
index,
start: args.start,
end: args.end,
timefield: '@timestamp',
},
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response from fetchLogRateAnalysis is currently sent untransformed to the LLM. I suggest we remove any data that we don't expect the LLM to be able to understand to reduce noise. significantItems seems like the most important. windowParameters and dateHistogramBuckets on the other hand doesn't seem terribly important and we may want to omit them.
WDYT @dgieselaar ?

{
  "key": "logRateAnalysis",
  "description": "Significant field/value pairs in log data that contributed to changes in the log rate.",
  "data": {
    "logRateChange": {
      "type": "spike",
      "timestamp": 1719409656000,
      "logRateChangeCount": 15581,
      "averageLogRateCount": 5795,
      "logRateAggregationIntervalUsedForAnalysis": "a few seconds",
      "documentSamplingFactorForAnalysis": 0.1,
      "extendedChangePoint": {
        "startTs": 1719409464000,
        "endTs": 1719410040000
      }
    },
    "significantItems": [],
    "dateHistogramBuckets": {
      "1719408840000": 5473,
      "1719408864000": 13756,
      "1719408888000": 14171,
      "1719408912000": 14903,
      "1719408936000": 15225,
      "1719408960000": 13459,
      "1719408984000": 13005,
      "1719409008000": 1269,
      "1719409032000": 395,
      "1719409056000": 284,
      "1719409080000": 391,
      "1719409104000": 376,
      "1719409128000": 335,
      "1719409152000": 357,
      "1719409176000": 321,
      "1719409200000": 282,
      "1719409224000": 381,
      "1719409248000": 305,
      "1719409272000": 371,
      "1719409296000": 268,
      "1719409320000": 314,
      "1719409344000": 344,
      "1719409368000": 234,
      "1719409392000": 0,
      "1719409416000": 0,
      "1719409440000": 0,
      "1719409464000": 8495,
      "1719409488000": 13564,
      "1719409512000": 13804,
      "1719409536000": 14125,
      "1719409560000": 13367,
      "1719409584000": 13748,
      "1719409608000": 14173,
      "1719409632000": 14410,
      "1719409656000": 15581,
      "1719409680000": 14873,
      "1719409704000": 14491,
      "1719409728000": 14086,
      "1719409752000": 12799,
      "1719409776000": 15265,
      "1719409800000": 15139,
      "1719409824000": 14291,
      "1719409848000": 14002,
      "1719409872000": 15321,
      "1719409896000": 14048,
      "1719409920000": 14248,
      "1719409944000": 14656,
      "1719409968000": 13425,
      "1719409992000": 13673,
      "1719410016000": 13690,
      "1719410040000": 8707,
      "1719410064000": 189,
      "1719410088000": 0,
      "1719410112000": 0,
      "1719410136000": 0,
      "1719410160000": 0,
      "1719410184000": 0,
      "1719410208000": 0,
      "1719410232000": 0,
      "1719410256000": 0,
      "1719410280000": 0,
      "1719410304000": 0,
      "1719410328000": 0,
      "1719410352000": 0,
      "1719410376000": 0,
      "1719410400000": 0,
      "1719410424000": 0,
      "1719410448000": 0,
      "1719410472000": 0,
      "1719410496000": 0,
      "1719410520000": 0,
      "1719410544000": 0,
      "1719410568000": 0,
      "1719410592000": 0,
      "1719410616000": 0,
      "1719410640000": 0
    },
    "windowParameters": {
      "baselineMin": 1719407664000,
      "baselineMax": 1719409464000,
      "deviationMin": 1719409488000,
      "deviationMax": 1719410040000
    }
  }
}


return {
key: 'logRateAnalysis',
description: `Significant field/value pairs in log data that contributed to changes in the log rate.`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description can you elaborate a bit about what these logs are. Perhaps the index being analysed, the filters used, and the entities being analysed. And lastly how to interpret the result eg. "significantItems indicates when a significant change happened for a particular dimension.
This should help the LLM understand the result a bit better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants